What is the formal NLP term for matching text spans with variations, and what're the recommended approaches?

edenyin · May 30, 2025, 6:53am

I’m implementing a document analysis system that needs to locate specific text segments within larger documents. Given a reference text snippet, I need to find where this content appears in the original document(span), even when there might be slight differences in formatting, punctuation, or wording.

I’d like to know:

The formal NLP/IR terminology for this type of task. Is this considered “approximate string matching,” “span detection” or something else? Having the correct terminology will help me research existing literature and solutions. I’ve done some research on “span detection”/“span extraction”, but they might not suit my scenario that much? Because I found they’re more focused on biology or different NLP tasks like emotion extraction or Named Entity Recognition.
Recommended approaches for solving this specific problem:

Mdrnfox · May 30, 2025, 12:28pm

I think you are referring to possibly Approximate String Matching, Span Passage Alignment, passage/passage-level retrieval. Those should get you started.

You will probably see things like TF-IDF, BM25, Dense Embeddings, etc.

Hope this helps

Needabiggermachine · May 31, 2025, 5:37am

Grep? Or other regular expressions?

Topic		Replies	Views
Identifying sections of a text document Models	1	941	November 22, 2021
Functionality for converting character-level spans to token-level spans? Beginners	0	476	July 3, 2021
Multilingual token, phrase and sentence representations for text similarity Research	0	490	January 13, 2021
How to extract a specific paragraph from a text file 🤗Transformers	2	730	May 29, 2024
NLP advise seeked for news processing Beginners	0	364	June 19, 2022

What is the formal NLP term for matching text spans with variations, and what're the recommended approaches?

Related topics